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Abstract 

A DNA palindrome is a segment of double-stranded DNA sequence with inver- 
sion symmetry which may form secondary structures conferring significant biolog- 
ical functions ranging from RNA transcription to DNA replication. To test if the 
_, . clusters of DNA palindromes distribute randomly is an interesting bioinformatic 

c/3 ' problem, where the occurrence rate of the DNA palindromes is a key estimator 
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for setting up a test. The most commonly used statistics for estimating the occur- 
rence rate for scan statistics is the average rate. However, in our simulation, the 
average rate may double the null occurrence rate of DNA palindromes due to hot 
spot regions of 3000 bp's in a herpes virus genome. Here, we propose a formula 



■ to estimate the occurrence rate through an analytic derivation under a Markov 

assumption on DNA sequence. Our simulation study shows that the performance 
of this method has improved the accuracy and robustness against hot spots, as 
^ , compared to the commonly used average rate. In addition, we derived analytical 



formula for the moment-generating functions of various statistics under a Markov 
model, enabling further calculations of p- values. 
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1 Introduction 

A chromosome is a long sequence of double helix DNA made of base pairing by an 
adenine-thymine (A = T) pair or a cytosine-guanine(C = G). Thus, one DNA strand 
decides the sequence of its complementary strand. A segment of DNA sequence with 
half length greater than or equal to a pre-specified length L is called a palindrome if one 
strand is identical to its complementary one running at the reverse direction. It has been 
observed that DNA palindromes are common candidates for searching genetic motifs 
involved in different cellular processes, including gene transcriptions, gene replications, 
and gene deletions. For example, among nine octameres suggested to be transcription 
factor binding sites, three are palindromes (FitzGerald et al, 2004). This might be 
contributed by its potential to create the secondary genomic structure (Leach, 1994). 

Many studies have focused on investigating the occurrence rates of palindromes 
in suspicious regions against random sequences. For example, Lisnic and Svetec (2005) 
investigated the frequencies of Palindromes in the yeast Saccharmyces cerevisiae genome 
according to the length and contents of palindromes. Chew et al (2005) proposed three 
score schemes, based on occurrence rates, length or its likelihood, to quantify the palin- 
dromes and found the association between the high score regions and the replication 
origins. Lu et al (2007) reported that meaningful sites tend to have higher palindrome 
scores by comparing the scores over the regions including introns, exons, and upstream 
of transcription start sites against simulated random sequences. 

The performance of these comparison tests strongly depends on how accurate the 
occurrence rate is estimated for the random sequence. This rate is usually estimated 
by the average rate of palindromes on the genome-wide sequence. Another approach 
is the iid model based estimator which a formula has been derived when the DNA 
letter frequencies are estimated (Chew, et al, 2005). However, we observed obvious 
discrepancies between these two estimates in various herpes virus genomes. For an 
example on the BHVlCGEN(BoHVl) sequence, average rate is 0.00166 and the iid 
model method estimate the rate as 0.00073. While the average rate might be bias due to 
hot spot regions, the iid model might be too naive to describe the DNA sequence. In this 
paper, we provided a formula to calculate the occurrence rate under a Markov model, 
which the iid model would become a special case. For the BoHVl case, our method 
estimates the rate as 0.00098. Simulations are designed to check the performance of the 
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estimates on the null occurrence rate, including with and without hot spot segments 
in the random sequences. The results show that our method performs better than the 
average rate in estimating the null occurrence rate against hot spot regions. 

Chan and Zhang (2007) developed a method to approximate the p-value of statis- 
tics for weighted Poisson process, which can be applied on the DNA palindrome prob- 
lems. In their approach, the analytic formula for the moment generating function (MGF) 
of the palindrome score is required. However, the distribution of the palindrome scores 
have not been well studied except the length score under iid assumption. Thus, we 
developed a method to derive the analytic formula for the MGF on various scores under 
Markov model. Furthermore, this analytic formula allows us to calculate an overshoot 
term in the p-value approximation. 

This paper is organized as follows. In section 2, we show that three commonly 
used scores proposed by Chew et. al. (2005) can be derived by a likelihood approach 
firstly. Secondly, we show that the occurrence rates can be calculated accurately under 
Markov model through constructing a quasi transition matrix T. Thirdly, we derive the 
moment generating function for various scores under the Markov model. Last, we gave a 
p-value approximation with more precise calculations on the overshoot term. In section 
3, we show the numerical study for both real data and simulated data. This paper ends 
with a brief discussion. 

2 Method 

2.1 Notations and Log Likelihood Ratio Statistics 

Let N{t) be a counting process to describe the occurrence of palindromes and let Nw{t) = 
N(t + w) —N{t) denote the number of events in the interval {t,t + w]. Leung et al (2005) 
proved that N(t) can be approximated by a Poisson process under Markov Model. We 
let Xi be the score for the i*^ event along the genome sequence. Si\f^(^t) is the summation 
of the Palindrome scores inside the interval {t, t + w], which can be expressed by equation 

N{t+w) 
i=N{t)+l 

To search the clusters of palindromes. Chew et al (2005) proposed 3 schemes on 
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scoring palindromes for prediction of replication origins in herpes viruses. They are 
palindrome count score(PCS), palindrome length score(PLS), and base-pair weighted 
score of order m {BWSm)- PCS gives score one for each DNA palindrome; PLS gives 
the score as the palindrome length divided by its minimum required lenth; whereas 
BWSm gives the score as the minus log-likelihood with Markov order m. 

We would like to show that both N^it) and Spf^i^t) are equivalent to some log- 
likelihood ratio statistics when the alternative hypotheses are properly constructed. 
Under the Poisson process model, Xj's can be treated as iid with a density function 
fei^x) = fo{x) exp{9x — 4>{0)), where /o(x) is an unknown distribution and 0(6') = 
log J e^^ fo{x)dx. The parameters for N(t) and Xi are (Aa, 9a) for those events occurred in 
the interval (ta, ta+w] and (Aq, 6^0) otherwise; and the null hypothesis is A^ = Aq and 9a = 
9q. Whenta is known, the likelihood ratio is f\,,ea{Nw{ta), SN^{ta)) / f\o,eo{Nvo{ta), S^^i^ta))^ 
where the likelihood is as follows: 

= fxiNUt))f9{SN^it)\NUt)) 

Because ta is usually unknown, we search the maximum of the statistic over all 
possible t. 

Case 1. If the alternative hypothesis is constructed as Ha : Aa = Ai > Aq and 9a = 9o, 
then the log-likelihood ratio statistic is equivalent to PCS in Chew et al (2005), 
which is shown as follows 

i/\ n\ 1 f fxueai^wit), Siy^(^t))\ ,Ai 

max/t(Ai,6'o) = maxlog = max ^^^(t) log(— )-(Ai-Ao)tf . 

* * \JXo,eo[^^A'^)^^N^(t))J * Ao 

(2) 

Case 2. If the alternative hypothesis is constructed as Ha : A^ = Ai > Aq and 9a = 9i > 9o, 
where Ai and 9i are with the constraint 

log(^) - (0(^i) - 0(^0)) = 0, (3) 

the log-likelihood ratio statistic in formula (j4]) can be equivalent to PLS or BWSm 
proposed by Chew et al (2005), depending on the definition of Xj's. 

max lt{Xu9i) = max log 

* * \JXo,9oU^u,[t),bN^.{t))/ 

= max { - (Al - Xo)w + (^1 - ^o)^7v„(t)} (4) 
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It can be observed that is equivalent to maxj A^^(t) and (jl]) is equivalent to 
maxj 5'Ar^(i). While ([2]) only tests the Poisson parameter A, (jlj) tests both the Poisson 
parameter A and score parameter 6 with the constraint ([3]). It may be helpful to be 
reminded that N^{t) can be treated as a special case of S^^i^t) with Xi = 1 for each i. 

Chan and Zhang (2007) developed an approximation method to calculate p-value 
of the scan statistics on a weighted Poisson process, which can be applied to derive the 
threshold value of ([1]) if the MGF (j){6) of Xi is properly formulated. Let N{t) be a 
Poisson process with mean Aq and moment generating function (MGF) Xi^s are iid with 
mean /io, then 



^ 1 _ exj9 {-{W - w)ux,,e,ib - Ao/io)e-[''^-'"('^-'°)l(27r^Ai0"(^i))-^/2) , (5) 

where W is the total length of the sequence and i^Ai.Si is an overshoot correction term 
and 6i and Ai satisfy the equations: 

Ai0'(^i) = b, 

log(Ai/Ao) = m)-0(^o). 

Whether N^it) or Siy^(^t) is used in testing the null hypothesis, Aq always plays a 
crucial role. If Aq is overestimated seriously, the test would be too conservative and lose 
its power. Alternatively, if Aq is underestimated seriously, the test would fail. 

2.2 Occurrence rate of DNA palindromes under Markov model 

The average rate is a commonly used estimator for the null parameter of scan statistics. 
Yet, in various herpes virus genomes, it can be observed that the average rate is positive 
bias affected by some hot spot regions. On the other hand, the iid mode may not be a 
good model to describe the DNA sequence well since it ignores the correlation between 
adjacent DNA letters. Thus, we developed a method to calculate the occurrence rate 
of the palindromes under a Markov model. We constructed a matrix T, with Tij = 
PaiajPajcii which groups together the transition probabilities of symmetric complimentary 
pairs. For example, AG would conjugate with CT on its mirror site which leads to define 
Ti3 = PagPct, and we call T a quasi transition matrix because its row does not sum to 
one. 
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Theorem 1 Assume that DNA letters along the genome sequence follow a Markov 
model with transition probability {Pa^b|a, 6 G {A,C,G,T}} and the letter frequency 
Pq = (t^a ttc ttg ttt), then the occurrence probability of a palindrome given a starting 
position with half length greater or equal to L is 

Xm = P{\\I\\>L) = P;,T'^~'P, (6) 

where / describes the palindromic pattern given a starting position and ||/|| denotes the 
corresponding maximum length, 

Pi = (Pat Pcg Pgc Pta) , 

and 

PaaPtt PacPgt PagPct PatPat 

^ PcaPtg Pcc'Pgg PcgPcg PctPag 

PgaPtc PgcPgc PggPcc PgtPac 

y PtaPta PtcPga PtgPca PttPaa j 
Proof: The set that a DNA palindrome with half length greater or equal to L, is 

equivalent to the set that the center 2L letters follows a palindrome pattern. Given a 

sequence of length 2L, it must satisfy that a^+fc = ciL^k+i to become a palindrome, Oj 

means the complementary letter of Oj. Then, under a Markov model, we can sum the 

probability over all possible the letters and get Am- 



Am= P(||/||>L) 



Evr P P P - P- - P- - P- - 



aiG{A,C,G,T} 
l<i<L 

= ^ ] ^ai (-Paia2-Pa2ai) • • • (-Pfli-iaL -Pa^^a/^.i ) Pa^di^ (7) 

aie{A,C,G,T} 
l<i<L 

Pai,a,+i is the transition probability for letter to letter Oj+i. T is the matrix form 
of (Paia2-fa2ai)- (El) cau bc vicwed as a matrix multiplication: a row vector multiplies a 
matrix to the power of L and then multiplies with a column vector. This technique is 
used repeatedly in this paper, including the proof for Theorem 3. 
Remark 1: When the Markov model is reduced to the iid model, P[ becomes 

P2 = ( TTr TTg VTc TV a 
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and T becomes P2Pq- Thus, 

^ P(|| / ||> L) = P^{P,P^f~'P, = (Pl^P,)^ = 7^ (8) 
where 7 = 2 {tiat^t + ttcttg). dH]) has been shown in Leung et al(2005). 

Theorem 2 With the same assumption in Theorem 1, the PLS score for the i'^'^ pahn- 
drome is defined as Xi = conditional on > L, where L is the minimum half 

length for the palindrome. Then, the MGF for Xj is 

KpLsit) = E (e^'*|||/.|| >L)= ^P^T^-i[/ - e*/^T]-^[/ - T]P, (9) 

Proof of Theorem 2 

E(e^"*|||/i|| > L) 

= J2 e''^'[pm\ >k)- p{m > k + i)]/p{m\ > l) 

k=L 

00 



P^^e'=*/^T'=-^(/-T)Pi/AM 

=L 

P^T^-^[I - e^/^T]-^[I -T]Pi (10) 



e 

Am 



Remark 2: When the Markov model is reduced to iid model, 

e*(l-7) 



1 — e*/-'"7 

k=L ' 

Theorem 3 With the same assumption in Theorem 1, the BWS score is defined as 
Xi = —log{P{Ii)) conditional on > L. Then, the MGF for Xi is 

KBWs{t) ^ E[e^'*|||/,|| >L] = -^v'(t)[/ - Q{t)r'[Q{t)]'^~'M{t), (11) 

Am 

where v(t) = {viit) V2(t) v-s(t) f4(t))' is defined as Vi (t) = ([(/-T)Po]0 ; g(t) is defined 
as Qijit) = (Tjj)'^-'^"*^; and u{t) = {ui{t) U2{t) U3{t) U4(t))' is defined as Ui{t) = ([-Pi]i)^~* 
with i = 1, - ■ ■ ,4. 
Proof of Theorem 3 



P(/j = ai ■ ■ -afcOfc ■ ■ - ai, = k) 

{(^ai ~ ^ ] ^aoPaoaiPaiao)Paia2 ■ ■ ■ Pafc_iafcPafeafePafcafc_i ■ ■ ■ Pa2ai}- 

aoe{A,C,G,T} 
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Thus, we have 
K{t,k) 

E[e^»*; II Jill = k] = E[{P{Ii occurs})"* ; ||/i|| = k] 



EUn - TT P P- - )P ■■■P P - P- - ...p--l^^"*^ 

\v'ai / J "ao-^ 0,00-1-^ aiao)'- aia2 a^:_la^.,'- a^a^^ akO-k-l ^ «2ai / 

A,C.,G,T} aoG{A,C,G,r} 

<3<k 

^ ^ (^ai ~ ^ ] '''"aoPaoaiPaiao) (Paia2 Pa2ai ) ^ ^ X ■ • • 



aje{A.c,G,T} aoe{A,C,G,T} 

l<j<k 



X (p p- - "i^^"*^ rp - 1^^-*) 

lfc-1 



= v'(t)[Q(t)]^-^u(t). (12) 

Then, taking the sum over k = L to oo and dividing by Xm lead to (fTTl) . 
Remark 3 

When the Markov model is reduced to iid model, ffT2l) becomes 



K(t,A:) = (1 -7)^-*p^(t)(P2(t)p^(t))^"^P2(t) = ii-iy-\p;>it)p,it)r = a-iY-'i^, 

where P^{t) = {n\-' 4"*), P^(t) = (4^* n'c' n],-'), and 7^ = P^{t)P2{t) = 

2[(vr^vrT)^^* + (ttcttg)^^*]. So, for iid model, 

ir(t) = il^^ f ^ V • (13) 



1 - 7t V 7 
Remark 4 

The conditional process involved in the overshoot term in the p- value approximation can 

Af(A) N*{A) 

be approximated by a partial sum of iid copies of ?/ = (— ^ + ^ x^), where N{-) 

k=l k=l 

and A^*(-) are iid Poisson processes with rates Aq and Ai; Xj's and x*'s are independent 
random variables with density functions fe^ and /^^ . The derivation is in the appendix. 
By the same method in Theorem 3 and Theorem 4, the characteristic function of y can 
be derived. Applying Theorem 1 in Tu(2009), the overshoot term can be calculated. 



3 Real Data Analyses and Simulations 

We studied 27 herpesvirus genome sequences from the database of EBI Nucleotide Se- 
quences. For each sequence, we estimated the transition matrix and the stationary 
probabilities of DNA letters {A,C,G,T}. Theorem 1 is applied to estimate the null 
occurrence rate for each sequence. These results are compared with those estimated by 
their average rates in Figure [T] The average rates show higher values consistently. 
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Estimated null occurrence rate for DNA palindromes 



Average 
Markov 
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_ >>>>>Xxx 
fflOOXXXWCO =i U5w 

■"-------^Oq- 



JOOUJIUOOXXXXXXX 

27 herpes virus genomic sequences 



Figure 1: 27 herpes virus genomic sequences were downloaded from the database of 
EBI Nucleotide Sequences. Two methods for estimating the null palindrome rates are 
presented, including the average rate, and the Markov model based estimator. We 
adopted the abbreviation for naming the genome sequences used in Leung et al. (2005) 
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We also checked the accuracy performance of these two methods through numeri- 
cal simulation. While a real DNA sequence may contain meaningful DNA codes which 
contribute to its non-randomness, random sequences are generated to fit the null hy- 
pothesis. All the parameters involved in generating the random sequences, including 
the stationary probabilities vr and the transition matrix P, are estimated on the BoHVl 
sequence. BoHVl sequence, with sequence ID BHVICGEN, contains 135301 bases. The 
state probabilities are estimated as 

TT = (0.1354(A), 0.3588(C), 0.3654(G), 0.1404(T)) 

and the transition probabilities are 







A 


G 


G 


T 


\ 




A 


0.1854 


0.3288 


0.3556 


0.1303 






C 


0.1258 


0.2932 


0.4347 


0.1463 






G 


0.1343 


0.4512 


0.2994 


0.1151 




v 


T 


0.1141 


0.3151 


0.3695 


0.2012 


/ 



The half length L = 6 is adopted to be the criterion as a palindrome event. Palin- 
drome events along these random sequences could be well approximated by a homoge- 
neous Poisson process. It may be helpful to be reminded that, in this case, the average 
rate A is the maximum likelihood estimator (MLE) for the occurrence rate. Our simula- 
tion shows that both these two methods do the estimate well in the first numerical row 
of Table 1. 

The validity that the average rate can be a null parameter estimator is based on the 
assumption that the number of events from non-random clusters is much smaller than 
the total number of events. However, this assumption may not work for a real DNA 
sequence. It has been observed that meaningful sites in the sequence tends to have 
higher palindrome rates. The average rate usually overestimates the null occurrence 
rates. Here, we design a simulation experiment to check the robustness of the estimates 
against hot spot regions. 

For each random sequence, we insert three hot spot segments with length 1000 
base pairs at different positions. The inserted segments contain palindromes which are 
randomly resampled from the palindrome bank. The palindrome bank collects all the 
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Ctl (X2 CL^ 


A 


Aa^ 


1 1 1 


.001078 


.001099 


10 10 10 


.001402 


.001110 


10 10 20 


.001515 


.001113 


10 20 20 


.001643 


.001117 


20 20 20 


.001739 


.001142 


30 30 30 


.002105 


.001135 



Table 1: Tho methods for estimating the null occurrence rate of palindrome sequences 
are compared when non-random clusters exist. For each random sequence, three non- 
random clusters are inserted with adjustable occurrence rates: Aj = ajAo, 1 < « < 3. 
Ao = .00098. The first row, with ai = 02 = 03 = 1, means complete random sequence 
with no hot. 

DNA palindromes from BoHVl sequences. We assigned three occurrence rates for the 
three segments as A^ = a^Ao, 1 < « < 3 and Aq = .00098 is estimated by Markov model 
for BoHVl sequence, a^'s are to quantify the intensities of hot spots. The simulation 
results for various components of (Ai, A2, A3) based on 500 repeats are presented in Table 
[TJ The estimators based on model calculation increase less than 8% while the estimator 
based on the average rate almost doubles, when the occurrence rates in the hot-spot 
regions increase to 30 folds. 

Overestimating the occurrence rate would increase the threshold value for testing 
hypothesis and lead to power loss. The simulation for power comparisons in Table 4 is 
designed as that of Table 3. Table |2] shows the powers for detecting each of the three 
hot spot regions of DNA palindromes. We applied the PLS scores and BWS scores with 
window size 1000 bp to scan the whole genome. The calculation for threshold values 
follows Chan and Zhang (2007) on weighted scan statistics, with modification on the 
overshoot term, which is shown in the appendix of this paper. Here, power is defined 
as the frequencies of detecting hot spot regions based on 500 replicates. Table [2] shows 
that Am can gain powers more than 50% over A, when power is not saturated. 
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PLS 




(01,02,03) 


A 


Am 


Threshold 


Power 


Threshold 


Power 




8.9063 


0.0000 0.0000 0.0000 


9.0061 


0.0000 0.0000 0.0000 


(7,7,7) 


9.6221 


0.2100 0.2025 0.2275 


9.0399 


0.2975 0.2900 0.2900 


/'in in 1 n\ 
(^iU,iU,iU ) 


9.9477 


0.4550 0.5075 0.4800 


9.0496 


0.5825 0.6250 0.6325 


/'I n 1 n on^ 
(^iU,iU,zU ) 


10.3013 


0.4300 0.5100 0.9875 


9.0686 


0.5950 0.6575 0.9950 


n on on^ 
(^iU,zU,zU ) 


10.6435 


0.3825 0.9900 0.9775 


9.0877 


0.6350 0.9975 0.9975 


/'on on on^ 
(^ZU,zU,zU ) 


11.0216 


0.9675 0.9850 0.9850 


9.1014 


0.9900 0.9925 0.9975 


BWS 




(01,02,03) 


A 


Am 


Threshold 


Power 


Threshold 


Power 


(1,1,1) 


114.4505 


0.0000 0.0000 0.0000 


115.7137 


0.0000 0.0000 0.0000 


(7,7,7) 


123.2021 


0.1950 0.2425 0.2625 


115.9571 


0.2700 0.3250 0.3200 


(10,10,10) 


127.5283 


0.5150 0.5325 0.5525 


116.0439 


0.6650 0.6650 0.6800 


(10,10,20) 


130.8699 


0.4575 0.4625 0.9800 


116.1847 


0.6425 0.6325 0.9925 


(10,20,20) 


133.7581 


0.4100 0.9850 0.9775 


116.1572 


0.6125 1.0000 0.9975 


(20,20,20) 


140.2448 


0.9825 0.9825 0.9750 


116.3187 


0.9950 1.0000 0.9925 



Table 2: Powers are compared for using A and Am to estimate the null occurrence rates 
of DNA palindromes when hot spot regions are inserted. A tends to be too conservative 
by overestimate the occurrence rates. 
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4 Discussion 

Average rate is a popular method for estimating the null occurrence rate of scan statis- 
tics. In this paper, we show that it does not always work through an example. Average 
rate can overestimate the null occurrence rate twice the true number, in the herpesvirus 
genome simulation. We further proposed a model based estimator, which avoids to di- 
rectly count the number of events in hot spot regions. Our method estimates the Markov 
parameters instead of estimating the occurrence rate directly. 

The hot spot regions have potential to contribute a large portion of the number 
of events, especially when the null occurrence rate is very low. On the other hand, 
when estimating the transition probabilities for transition as well as the stationary state 
probabilities under the Markov model, the hot spots have little influence provided their 
size is much smaller than the total length of the genomes. This explains why Xm is not 
sensitive to the hot spot effect. Our study suggests that average rate should be carefully 
used for null parameter estimation, especially when the process involves rare events with 
hot spot regions, which are quite common in epidemiology studies with rare diseases. 



5 Appendix 

Chan and Zhang (2007) have provided a p- value approximation for the scan statistics of 
marked Poisson processes. Here, we provide a more general formula for calculating the 
overshoot term on various distribution of Xi. Let iV be a Poisson process with constant 
rate Aq > and let random variables ~ feoi')- Let Ai and 6i satisfy two 

conditions : (a) Ai</)'(6'i) = b. (b) log(Ai/Ao) — {4>{0i) — <P{do)) = 0. Then we have the 
following theorem. 

Theorem 4 Let — oo as w — oo such that W — w ^ oo. Then 

Po( max Sm^(^s) >b)^l- exp{ - {W - w)ux,,eAb - Ao/io)e-'(')'"(27rwAi0"(^i))-i/- 



0<s<W 



with i^Ai.ei * ^° 



Proof of Theorem 4 

Assume that the process is observed on the set {tj\tj = j A, < tj < W}, where 
A = o{w), then we have the inequality: 
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P( max Sm a A) > b) <P( max Sm (.) > b) <P( max Sm , aCiAi > b). 

It can be shown that P( max > converges when w converges to a constant 

such that hniA^nPf niax Sn aA) > b) = P( max Siy (s) > b). In fact, in this study, if 
we let W be the total number of DNA base pairs, then A equals 1 instead of converging 
to 0. 

First, we decompose the probability by the last time conditioning ti, = SMp{j|S'7v„(jA) > 
b} used in (Woodroofe, 1979) 

^ ( O^A<W ^^-(^'^^ - ^) = ^ = 

0<j<liW-w)/A\ 

[(W-w)/Ai 



Ij<s<1{W-w)/A\ ) 

V P<^ max 5jv„yA) <b,SN^{o) = b + k\. 



j=0 

(ly-w; 



A 

fc=0 

This approximation technique can be found in (Tu and Siegmund, 1999). We applied 
the new measure Q introduced in (Chan and Zhang, 2007), which Q is defined as that N 
is nonuniform poisson with rate Ai on (0, w] and rate Aq on (u;, W]\ Xj '~' fei{-) for 1 < 
i < N{w) and Xi fe,{-) for N{w) < i < N{W). By (a) and (b), 

^{A^,xi, . . . xn{w)} 
= exp(S'7v^(o)(6'i - 6*0) - (Ai - Xo)w). 

By change of measure, we have 

00 

P( max S'7v„(iA) < b, Sn^(^o) = b + k) 

' * 0<2<OO 

fc=0 

°° dP 

= ^Q[:7n ^^n^-^^ ^N^{iA) < b, Sn^^o) =b + k}] 

00 

= ^ e-^W'^-^-^^-^^^Ql max - ^tv.w < -A;|^iv.(o) = & + fc)Q(^7v.(o) = & + fc), 

* 0<2<OO 

fc=0 

where J(6) = &(^i - ^o)/w - (Ai - Aq). 
By local CLT, 

Qi{SN,^,^,) = b + k)^[2Txw\i<p"e^r^i\ 
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Let {A^*(t), Xi, . . . , x^.(j)} be independent with {N{t), xi, . . . , XN{t)} and N*{t) be a 
poisson process with rate Ai and x* is distributed from fei{-)] let w and b be large 
enough such that 

N{jA) N*{jA) 

Q( max ^TV^OA) - Sn^{o) < -klS^^^o) = 6 + A;) ^ P<^ min (- V + V xl) > k\ 

0<j<oo 0<j<oo fe"^ -1 

Ar{A) Ar*{A) 

Let = (— X] ^fc + Yl ^k)y 2/2) 2/3) ■■■ are iid copies of yi. By (8.13) in 

k=i k=i 
Siegmund(1985), we have 

P( min Sn > k) = ^ "^"^ — where 5"^ = Ui and r+ = inf {n : Sn > 0}. 

0<n<oo Eq'-'t. 



i=l 



Since £ e-'=(^i-^o)P(^^+ > k) can be expressed as {1 - Ee-^^+^^'-^"^)/{l - e'^'^'-^"^), we 

fc=0 

have 

VpI max < 6,5^^(0) =6 + fcj ^t;A,A(%i)e"'^'^"'(2vru.Ai0"(^i))-i/^ 

Therefore, 



P( max >b)^ 1 -exp Ao/io)e-^('')'"(27rw;Ai0"(^i))-i/2 

0<s<H' I 

By Theorem 1 of (Tu, 2009), the overshoot fAi,6ii can be calculated when the character- 
istic function Ee'*^^ is known. Let (f){t) = Ee**^^. We have 

IT 



E[exp{-zt y: = E^'H) r ^ 



gAoA(</.(-t)-l) 



i=i fc=o 
and 

JV*(A) N(A) N{A) 

E[exp{zt 5^ X*}] = EQ[exp{zt ^ x,}] = E[^exp{zt ^ x,}] = e{-^i^+^o^^(*-(^i-^o»}. 
j=i j=i j=i 

So Ee**^^ is derived. 
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